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ABSTRACT 

The performance of processing search queries depends heav- 
ily on the stored index size. Accordingly, considerable re- 
search efforts have been devoted to the development of effi- 
cient compression techniques for inverted indexes. Roughly, 
index compression relies on two factors: the ordering of the 
indexed documents, which strives to position similar docu- 
ments in proximity, and the encoding of the inverted lists 
that result from the ordered stream of documents. Large 
commercial search engines index tens of billions of pages 
of the ever growing Web. The sheer size of their indexes 
dictates the distribution of documents among thousands of 
servers in a scheme called local index-partitioning, such that 
each server indexes only several millions pages. Due to en- 
gineering and runtime performance considerations, random 
distribution of documents to servers is common. However, 
random index-partitioning among many servers adversely 
impacts the resulting index sizes, as it decreases the effec- 
tiveness of document ordering schemes. 
We study the impact of random index-partitioning on doc- 
ument ordering schemes. We show that index-partitioning 
decreases the aggregated size of the inverted lists logarith- 
mically with the number of servers, when documents within 
each server are randomly reordered. On the other hand, the 
aggregated partitioned index size increases logarithmically 
with the number of servers, when state-of-the-art document 
ordering schemes, such as lexical URL sorting and cluster- 
ing with TSP, are applied. Finally, we justify the common 
practice of randomly distributing documents to servers, as 
we qualitatively show that despite its ill-effects on the ensu- 
ing compression, it decreases key factors in distributed query 
evaluation time by an order of magnitude as compared with 
partitioning techniques that compress better. 
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1. INTRODUCTION 

The searchable Web spans tens of billions of pages, yet 
search engine users expect fresh and relevant search results 
to be delivered within less than a second. Serving simul- 
taneously thousands of queries, web search engines use an 
inverted index, a data structure that supports efficient re- 
trieval of documents containing a set of terms given by the 
user's query. Due to the huge number of web pages and 
the resulting amount of data, the index is partitioned over 
thousands of servers, where each server typically stores and 
processes the inverted index of only several millions docu- 
ments [2l [5l [32ll20) . At query time, the query is sent to all 
servers for processing, and the top results retrieved from all 
servers are merged to produce the final results, which are 
returned to the user. 

The inverted index data structure contains a postings list 
for each unique term appearing in the corpus. The postings 
list of term t consists of the list of document identifier^ 
(doclds) containing t. The documents within each list are 
typically sorted by increasing doclds values, and the list is 
represented by encoding the gaps (called dGaps) between 
successive doclds. Another data structure in an inverted 
index is the lexicon, or dictionary, which is a lookup table 
that for each term t in the corpus, points to the postings list 
corresponding to i [5] 1321 13] . 

Index size has an important effect on system performance. 
In addition to the direct reduction in memory and disk 
space, more compact indexes lead to savings in I/O trans- 
fers and increase the hit rate of memory caches, offering 
an improvement in overall query processing throughput |31l 
130] . Consequently, a large body of work has focused on 
index compaction and compression methods. The structure 
described above leaves two main degrees of freedom for com- 
pression optimization: (a) the assignment of doclds to docu- 
ments (also referred to as document reordering); and (b) the 
actual encoding of the dGaps into bits (also referred to as 
dGap compression [28l|2Tlll|lllIl|3T]). This work focuses 



^Although terms frequencies and offsets within the docu- 
ment occupy a major portion of modern inverted indexes, 
we focus here on the documents identifiers only. 



on the former. 

The basic idea behind an effective docid assignment is to 
place "similar" documents close to each other, hence, po- 
tentially reducing the dGaps since similar documents con- 
tain many common terms. Such effective assignment pro- 
duces highly clustered posting lists where long "runs" of 
small dGaps are separated by large dGaps. In contrast, a 
random assignment of doclds would result in dGaps that 
approximately follow a Geometric distribution within each 
postings list [12]. The problem of finding the optimal do- 
cid assignment can be explicitly expressed in closed form, 
but is, unfortunately, NP-hard 6 . Therefore, most works 
on document assignment proposed various heuristics that 
includes approximations to the traveling salesman problem 
(TSP), solutions based on clustering algorithms, and solu- 
tions based on the natural URL lexicographical ordering of 
web pages [8] EH [27l [M EQI US] • 

All aforementioned efforts focused on compacting the in- 
verted index of a single server. However, large corpora are 
indexed over thousands of servers, with each server han- 
dling only several million documents [2] 1321 I20j . In order 
to better balance the number of result documents resulting 
on each server, thereby decreasing query processing time 
(see our experiments in Section [6}, documents are often dis- 
tributed randomly among the servers [2l[5l ll9ll20j . As first 
noted in [30], this index-partitioning operation may have a 
profound effect on document assignment algorithms, since 
similar documents (e.g., pages of the same web host) are 
often routed to different servers. This work examines the 
impact of random index-partitioning on the effectiveness of 
docId assignment algorithms that aim to compress the in- 
verted index. Our experiments are performed on the 25 
million web page TREC .gov2 collection. Our main contri- 
butions are the following: 

• We showcase the interplay between random index-parti- 
tioning and compression. 

• We quantitatively and analytically show that the per- 
formance gap between effective docId assignment heuris- 
tics and ineffective ones diminishes as the index is ran- 
domly partitioned over more servers. For example, 
with dGap Delta encoding, the total length of the in- 
verted lists actually decreases logarithmically with the 
number of partitions when doclds are assigned ran- 
domly. On the other hand, partitioning causes that 
size to increase logarithmically with the number of par- 
titions when effective docId assignments such as URL 
sorting, and clustering with TSP, are applied. Similar 
trends are reported for dGap block PForDelta encod- 
ing as well. 

• We study experimentally the factors that make the 
URL-based assignment perform well in practice. We 
show that inter-host ordering hardly matters, and that 
clustering pages by hosts with arbitrary intra-host or- 
dering already brings significant compression benefits. 

• We justify the common practice of randomly distribut- 
ing documents to servers, as we qualitatively show that 
despite its ill-effects on the ensuing compression, it de- 
creases key factors in distributed query evaluation time 
by an order of magnitude as compared with the better 
compressing URL-based partitioning. 



The rest of this work is organized as follows. Section [2] 
provides background and surveys related work. The 
experimental setup is described in section (3] Experi- 
mental results and analytical insight of the impact of 
partitioning on index sizes are reported in Sections |4] 
and [5] respectively. The impact of index partitioning 
on query processing time is considered in Section (6] 
Finally, we conclude in Section [7] 

2. BACKGROUND AND PRIOR WORK 
2.1 Index Partitioning 

The sheer size of the Web, the enormous number of search 
queries, and the required low latency, enforce a distributed 
inverted index architecture i2j i32i i5j . To support these re- 
quirements, both distribution and replication principles are 
applied. Replication (or mirroring) means making enough 
identical copies of the system so that the required query load 
can be served, and is beyond the scope of this work. Dis- 
tribution means the way the inverted index is partitioned 
across a collection of nodes. 

The two main strategies of partitioning an inverted index 
are local index-partitioning and global index-partitioning [3] 
122] . According to the local index-partitioning strategy (or 
document based partition), each node is responsible for a 
disjoint subset of documents in the collection. Each search 
query is sent to all nodes, each of which returns its top 
ranking documents for the query. Those lists are then com- 
bined in some way to provide the end result. In the global 
index-partitioning strategy (or term, based partition), terms 
are divided into disjoint subsets, such that each node stores 
postings lists only for a subset of terms. 

Due to various theoretical and practical considerations, 
large-scale search engines follow the local inverted index- 
partitioning strategy distributing documents across the nodes 
[S] [3] [25] . Documents can be distributed to nodes using dif- 
ferent policies. For example, the hash distribution policy 
allocates documents to nodes in a random fashion by hash- 
ing the documents' URLs to yield a node identifier |19ll20j . 
Other policies such as round-robin distribution axe also pos- 
sible [E]- 

While random distribution of documents to nodes is used 
by commercial search engines [2] [5] 1191 [20] , other distribu- 
tion schemes were considered in distributed information re- 
trieval systems and peer-to-peer networks. For instance, in 
[29] (see also [17] for a more recent work) the authors used a 
two-pass K-means clustering algorithm and a KL-divergence 
distance metric to organize a document collection into 100 
topical clusters (or shards) and demonstrated the benefits 
of selectively searching only a few shards per query. Query 
logs were used by the authors of [23] (see also [24j for a more 
recent work) to organize a document collection into multiple 
shards. Selectively searching shards defined by these clus- 
ters was found to be more effective than selectively searching 
randomly defined shards. Non-random distribution of docu- 
ments in a distributed search engine was recently considered 
in 18 , where the authors treat the routing of documents to 
nodes as an online problem in an incremental indexing set- 
ting. Under a model where routed documents are appended 
to the existing index partitions, the authors demonstrate 
a tradeoff between the compression of a locally-partitioned 
index and the balanced distribution of documents from the 
same host across the index partitions. 



2.2 Inverted Index Compression 

As mentioned in the Section [T] we consider a simplified 
model of an inverted index in which the postings list of term 
t holds the doclds containing t, sorted by increasing value. 
Denote the list by di , ^2 > • • • i "^nt > where d\ denotes the docid 
of the i'th document containing t out of nt such documents. 
The list is actually represented by encoding the first docId 
and the sequence of gaps (dGaps) between successive iden- 



tifiers thereafter, i.e. d^, d2 — di, 



d* 



di 



nt — l- 



The two 



degrees of freedom available for compressing the size of the 
lists are (a) docId assignment; and (b) dGap encoding. As 
we focus on the former, we start by briefly reviewing the lat- 
ter. dGap encoding techniques aim to compress a sequence 
of integers. The literature contains schemes that encode 
each gap individually, e.g. Gamma, Delta, Golomb-Rice [28] 
and Zeta 9, encodings, as well as schemes that encode cer- 
tain blocks of gaps, e.g. PForDelta [311 [14] and SimpleQ 
[I]. Additionally, the Interpolative Encoding scheme [21] is 
applied directly on the doclds rather than their dGaps, and 
works well for clustered term occurrences. 

In general, the docId assignment problem seeks a permu- 
tation of the documents that minimizes the inverted-index 
size under a specific dGap encoding scheme. As shown in 
[6], this problem is NP-hard and various heuristics are used 
to provide approximations. 

The size of an inverted-index is a function of the dGaps, 
which themselves depend on the way doclds are assigned 
to documents. All effective dGap encoding techniques rep- 
resent smaller numbers with fewer bits (about logarithmic 
in the number value). Hence, assigning doclds in a way 
which results in smaller dGaps is the key for better compres- 
sion. This principle drives most works dealing with docId 
assignment, which accordingly strive to assign close doclds 
to "similar" documents, i.e. documents that share many 
terms. 

Technically, most works define a graph G = {D, E), where 
D is the set of documents, and _E is a set of edges represent- 
ing the similarity between two documents di,dj G D. One 
line of work started by [25] traverses the graph G to find 
the maximal weight path connecting all the nodes, assign- 
ing doclds accordingly. This is equivalent to the NP-Hard 
traveling salesman problem (TSP). Several TSP approxima- 
tions were applied for docId assignment in [251 [3 [13]. In 
[25] . a simple greedy nearest neighbors (GNN) approach is 
used to add one edge at a time. To reduce the compu- 
tational load, [7] uses singular value decomposition (SVD) 
to reduce the dimensionality of the term-document matrix. 
To scale up TSP-based schemes [13] proposes a new frame- 
work based on computing TSP on a reduced sparse graph 
obtained through locality sensitive hashing. 

In yet another line of work, the nodes of G are clustered 
according to their similarity and close doclds are assigned 
to the nodes (documents) within each cluster. A top-down 
approach is used in [S], where the whole collection is re- 
cursively split into sub-collections, inserting "similar" nodes 
into the same sub-collections. Then, the sub-collections are 
merged into an ordered group of nodes. A bottom-up ap- 
proach called k-scan was proposed in [27] . A hybrid method 
which combines k-scan clustering and TSP for intra-cluster 
docId assignment is proposed by 6 , and will be used in the 
experiments reported in this paper. 

A different approach, which is both highly scalable and 
highly effective, was proposed for Web collections in [26| . 



It assigns doclds according to the lexicographically sorted 
order of the documents' URLs, utilizing the fact that URL 
similarity is a strong indicator of document similarity. The 
scheme was found to perform remarkably well on various 
Web collections indexed as a whole. It was not, to the best 
of our knowledge, examined for partitioned collections. 

In all the aforementioned works, a heuristic of docId as- 
signment or an encoding of dGaps were empirically tested 
against several collections and compared to the results of 
other works. In contrast, ;12 analyzes the compressibility 
of a collection whose documents are generated by a simple 
probabilistic model in which terms are chosen independently 
from a given distribution. 

3. EXPERIMENTAL SETUP 

We use the TREC .gov2 Web corpus, a collection of about 
25.2 million pages crawled from the gov domain, for the ex- 
periments. After parsing, tokenizing (with standard En- 
glish stopward removal and no stemming) and removing all 
empty documents, we are left with 24.9 million documents, 
74.5 million distinct terms, and 5,705.2 million postings (dis- 
tinct term appearances in documents). Whenever we parti- 
tion the corpus over m servers, documents are assigned to 
servers independently and uniformly at random. We then 
apply some docId assignment and dGap encoding schemes 
across all servers. Index sizes are reported using the bits 
per posting metric, defined below. 

3.1 The Bits per Posting Metric 

Let a corpus with A/" overall postings be indexed across m 
partitions, and let Ti denote the set of distinct terms on the 
i'th partition. Let i be a term appearing in rit documents in 
some partition, and denote those doclds by 1 < di < dl < 
. . . < d^j . Then, the overall size of all postings lists on the 
i'th partition, Vi, is given by 

P, = ^S(di,d^-d*, 



, a^^ ^nt — 1 j ' 



where S{-) is the length (in bits) of encoding the given inte- 
ger sequence. The overall size of the postings across the m 
partitions, V, is defined as 



^ = E^' 



We experiment with Delta and PForDelta encoding schemes. 
For Delta encoding 



S {d\,d\ - d'l, . . . , d^j - d^j_i) 



5{d{) + Y,5{d] 

J=2 



d]-i) 



where 5{k) is the length (in bits) of the Delta encoding of 
the positive integer k: 

5{k) = 1 + Llog2 k\ + 2Llog2(l + Llog2 fcJ)J . 

For PForDelta encoding scheme, each posting list is pro- 
cessed according to the scheme presented in |31J , with block 
length of 128 dGaps, and threshold of 90%. Shorter blocks 
at the end of long posting lists and short posting lists, down 
to 64 dGaps are encoded in a similar fashion, while blocks 
of less than 64 dGaps are simply Delta encoded. 

We further define the overhead OH of a partitioned index 
as the space taken by the m dictionaries of the individual 



partitions. Each entry of the i'th dictionary is a pointer into 
the sequence of postings lists on the i'th server, and hence 
requires logj Vi bitfl Overall, 

m 

OW = ^|T,|log2p, . 

1=1 

Finally, the bits per posting metric comes in two flavors, 
with and without overhead. Those are simply ^^."^^ and 
jj', respectively. 
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3.2 Docid Assignment Schemes 

As stated earlier, we are mainly interested in two aspects: 
(a) studying the impact of random index-partitioning on the 
bits per posting metric, and (b) gaining further insight into 
the power of the URL-based docId assignment scheme. We 
thus experiment with the following five docfd assignment 
schemes: 

Random assignment (RND): this method serves as a 
baseline for comparison purposes. 

URL-based sorting (URL): following [26) . the documents 
are sorted lexicographically based on their URL|j and 
docfds are assigned accordingly. 

Clustering assignment (KSCN-TSP): we adopt a pro- 
cedure presented in [7] where each server's collection 
is partitioned into K clusters, and GNN approxima- 
tion of TSP is used to assign the doclds within each 
cluster. We set the cluster size (and the number of 
clusters) to the square root of the server's corpus size, 
which is known to provide fair results. This heuristic 
represents, in this work, the state-of-the-art of schemes 
that are URL-agnostic. 

Intra-liost URL-based sorting (IH-URL): here, the hosts 
are randomly ordered, and URL-based ordering is kept 
within the hosts only. This scheme, when compared to 
the conventional URL scheme, should reveal the con- 
tribution of the inter-host lexicographical ordering to 
the power of URL-based docId assignment. 

Intra-host random assignment (IH-RND): here, doc- 
uments of the same host are assigned with consecutive 
doclds, but both the hosts and the documents within 
each host are randomly ordered. This scheme should 
reveal whether the power of URL-based assignment 
stems merely from the fact that documents of the same 
host are clustered together, or actually depends on the 
lexicographic ordering within each host. 

When comparing URL-agnostic docId assignment schemes 
(represented here by KSCN-TSP) to the URL sorting scheme 
over partitioned indexes, one hypotheses comes to mind: 
URL-agnostic schemes should outperform URL assignment 
when the corpus is highly partitioned, since they have the 
degree of freedom to arrange documents by similarity that 
transcends diluted URL patterns. 



4. EXPERIMENTAL RESULTS 

The bits per posting measure is plotted as function of the 
number of nodes with and without overhead in Figures [T]a 
and[l]b, respectively. The curves are plotted for the URL, 
IH-URL, IH-RND, and RND docId assignment schemes us- 
ing the full .gov2 corpus and Delta encoding. Figure [T]a 
demonstrates that without overhead, the aggregated size de- 
creases with the number of nodes for the RND assignment 
and increases for the URL bases schemes (i.e., URL, IH- 
URL, IH-RND). In particular, the ratio between the sizes of 
the RND and the URL assignments decreases from 2.2 when 
no partitioning is applied, to 1.45 when the corpus is parti- 
tioning over m = 10'^ nodes. When the overhead is included, 
the sizes achieved by all schemes increase with the number 
of nodes, although the performance of URL based schemes 
degrades at a faster rate than that of the RND scheme. As 
can be seen, in the region of interest, the curves are approx- 
imately linear in logm. Beyond this region, as the number 
of nodes increases, the URL based curves will coincide with 
that of the RND, and in the limit where each document is 
placed on a different node the number of bits per posting of 
all schemes go to one. 

In Figures Oa and[2]b, the bits per posting measure with 
and without overhead is plotted as function of the number of 
nodes, respectively. The curves are plotted for the URL, IH- 
URL, IH-RND, RND, and KSCAN-TSP docId assignment 
schemes using 3 million pages taken as a URL-continuous 
bulk from .gov2 corpuijj and Delta encoding. It can be seen 
that the compression achieved by the KSCAN-TSP scheme 
behaves similarly to that of the URL based schemes, and 
increases with the number of nodes. We note that although 
KSCAN-TSP is expected to perform as the RND scheme 
in the limit, where the number of nodes is large, one could 
expect that KSCAN-TSP will degrade more gracefully with 
the number of nodes than URL. This is since KSCAN-TSP 
(and other state-of-the-art schemes) have an additional de- 
gree of freedom over URL sorting, in their ability to reorder 
the local documents after partitioning. However, as seen 
here, both URL and KSCAN-TSP degrade at similar rates. 

Comparing the figures [T] and [2] produced for the full .gov2 
corpus and for the 3 million document sub-corpus respec- 
tively, using Delta encoding, reveals that the shapes of the 
curves and the relations between them are similar. This 
strengthens our conjecture that the same behavior also hold 
for web scale collections. 

Turning to PForDelta encoding. Figures [Sja and Ob plot 
the bits per posting measure as function of the number of 
nodes with and without overhead, respectively. The curves 
are plotted for the URL, IH-URL, IH-RND, and RND docId 
assignment schemes using the full .gov2 corpus and PForDelta 
encoding. In general, the trends visible for Delta encoding 
and all docId assignment schemes (Figure [5)) are also visi- 
ble here for the PForDelta encoding curves. Nevertheless, 
comparing figures [T] and [3] it is observed that while the RND 
assignment curve decreases in a lower rate than that of the 
Delta encoding curve, the URL based sorting curves are in- 
creasing in a higher rate than those of the Delta encodingj. 

Another observation, visible in all figures, relates to the 



^For simplicity, we assume that individual posting, as well 
as inverted lists, can start on arbitrary bit boundaries. 



The host name components are first inverted, see 
details. 



for 



^A smaller corpus is used due to run time considerations of 
the KSCAN-TSP scheme. 

^Also visible is the superiority of Delta encoding over the 
specific variant of the PForDelta schem e us ed here, which is 
consistent with the results reported in [30) . 



(a) Aggregated size without overiiead (gov2 corpus) 



(b) Aggregated size with overhead (gov2 corpus) 
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Figure 1: Bits per posting as function of the number of nodes for different docid assignment schemes and 
Delta encoding applied to .gov2 corpus, (a) without and (b) with overhead. 



true nature of URL sorting. By merely clustering each host's 
documents together, the IH-RND scheme achieves 75% to 
85% (for Delta encoding over the range of node numbers) of 
the performance improvement of URL sorting over random 
assignment. Moreover, the performance of IH-URL is almost 
identical to that of URL. To be precise, URL is slightly bet- 
ter than IH-URL (about 1% on the average) over the range 
of node numbers. We conclude that the impressive effective- 
ness of URL sorting for Web corpora such as .gov2, stems 
mostly from the act of clustering documents of the same host 
together (i.e., IH-RND scheme). Keeping the lexicograph- 
ical order within each host is the secondary contributor to 
the effectiveness of the URL scheme, and when combined 
with host clustering (i.e., IH-URL), it provides almost iden- 
tical performance to that of URL sorting. On the other 
hand, keeping the lexicographical order across hosts has a 
negligible effect, and hosts can be placed randomly with- 
out degrading the URL scheme's effectiveness. These con- 
clusions, while of little practical implication, provide some 
insight into the true nature of URL sorting. 

We note that these results were all generated under ran- 
dom document distribution to nodes (see Section [2TT]). Ex- 
perimental results (not presented here) with round-robin 
distribution did not produce qualitatively different results. 
In addition, experimental results (also not presented here) 
reveal a small variance between multiple runs. Hence, the 
corpora used are large enough that self averaging is domi- 
nant. Hence, multi runs are redundant and all the presented 
results are of a single run experiments. 

5. ANALYTICAL INSIGHT ON RESULTS 

This section provides analytical and illustrative explana- 
tions to some of our experimental results. In particular, 
we prove that for random docId assignment and individ- 
ual dGap logarithmic encoding (e.g.. Delta encoding), the 
average aggregated index size (ignoring overhead) is a non- 
increasing function of the number of partitions. Conversely, 
for URL (and IH-URL) assignment and individual dGap 



logarithmic encoding, we demonstrate that partitioning in- 
creases the aggregated size. We note that the impact of 
index-partitioning on docId assignment and PForDelta en- 
coding is much harder to explain since this encoding scheme 
works in blocks of dGaps, and is left for further study. 

Index-Partitioning and Random docId Assignment 

Our model for index-partitioning under random docId as- 
signment is as follows. Let there be |-D| documents and m 
nodes, and assume for simplicity that m divides \D\ and 
that the documents are evenly distributed across the nodes. 
We first draw uniformly at random (u.a.r.) a permutation 
TT over the documents, and then draw an equal partitioning 
(denoted by g™) of m sets of ^ documents each, also u.a.r. 

There are |D|! (m!) ™ such partitions. The document sets 
get assigned to the servers, with the internal order on each 
server respecting (being consisting with) vr. We aim to prove 
that the following expectation, denoted A™, is non- negative: 

A"^=E^JV{n)-r'^{7:,g)]>0, 

where Vin) denotes the length of all postings lists when the 
documents are ordered by tt and indexed on a single node, 
and 7-""(7r,p) denotes the aggregated length of all postings 
lists when the documents are partitioned by g into m nodes, 
with the internal order in each node respecting vr. Now, 

^™ = E^^En^[^W-^™(-'5)] 



(m!)" 



\D\\y 



■EE[^W-^'"(^'5)] 



Looking at the inner sum, observe that for a fixed parti- 
tion g and every permutation n there exists a single permu- 
tation TT that represents the concatenation of the m partial 
permutations. Furthermore, for a fixed g, the mapping be- 
tween -K and TT is 1:1 and onto. We now define the m-slice 
partitioning of a |-D|-sized permutation, denoted gm, as the 



(a) Aggregated size without overhead (3M-gov2 corpus) 



(b) Aggregated size with overhead (3M-gov2 corpus) 
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Figure 2: Bits per posting as function of the number of nodes for different docid assignment schemes and 
Delta encoding applied to a bulk of 3 million URL-continuous documents from .gov2 corpus, (a) without and 
(b) Avith overhead. 



process of assigning the first - — - documents to the first node, 

and so on, until assigning the last ^ documents to the m'th 
node. By definition applying g on tt is equivalent to applying 
gm on TT, and so; 

\\ I •/ g TT 

As TV goes over all \D\\ permutations, so does n, and thus 



ill 



\D\\y 



■EE[^w-^'"(*'' 



1 ^[P(^)„p-(^,g„)] 



To conclude the proof, we argue that Vtt, 'P{n)~'P"^{n, gm) > 
0. Since the transformation involves only slicing, all intra- 
slice dGaps remain the same for the original and partitioned 
indexes (or slices), while dGaps bridging across slices are 
shorter within the partitioned indexes. In expectation, the 
bridging dGaps are halved by the slicing process, and assum- 
ing a logarithmic encoding function (e.g. Delta encoding), 
about 1 bit is gained on account of each bridging dGap. 
As the number of nodes (or slices) m increases, more dGaps 
bridge across slices. Hence, the expected difference A™ does 
not decrease with m. 

Index-Partitioning and URL Sorting 

Ideally, a postings list following URL-based assignment in- 
cludes runs of small dGaps separated by long dGaps. To 
illustrate the impact of index-partitioning into m nodes on 
the performance of URL sorting, consider a specific post- 
ing list which begins with a single large dGap of A'^i, fol- 
lowed by a run of R dGaps of 1, another large dGap of A'^2, 
and another run of R dGaps of 1, with A'^i, N2 ^ R ^ 



m. Under Delta encoding, the size of the postings list is 
<5(iVi) -I- 5{N2) + 2R5{1). It is easily verified that the av- 
erage aggregated size after partitioning is approximated by 
m[5{Ni/m) + S{N2/m) + 2{R/m)S{l)]. Hence, the differ- 
ence in the postings list sizes after and before partitioning 
into m nodes is approximately m[5{Ni/m) + 5{N2/m)] — 
{5{Ni) + 5{N2)). Since Delta encoding behaves logarith- 
mically, partitioning increases the average overall size by 
approximately {m — l)(log2 A''i -I- logj N2) — 2mlog2 m. Ob- 
viously, this oversimplified example does not represent all 
cases, but it teaches us that for URL sorting (and IH-URL 
sorting), the encoding of the partitioned large dGaps of the 
original list causes its aggregated size to increase. 

6. DOCUMENT DISTRIBUTION SCHEMES 
AND QUERY PROCESSING TIME 

The previous sections demonstrated the deleterious effect 
of random distribution of documents to nodes, on the aggre- 
gated index sizes. This section examines the impact of doc- 
ument distribution schemes on other factors affecting query 
processing time, and demonstrates the significant benefits of 
random distribution - which make it the industry standard 
[2l[5ll20l ll9] . In particular, we qualitatively show that ran- 
dom distribution results in faster query processing than that 
achieved by the better compressing URL-based distribution 
scheme. 

6.1 Surrogates for Query Processing Time 

In order for our ensuing experiments and qualitative anal- 
ysis to be independent of specific retrieval algorithms or 
computational platforms, we use surrogate measures that 
are highly correlated with query evaluation time, for both 
disjunctive and conjunctive query models. In what follows, 
let q = {ii, . . . , tfc} be a fc-term query, and let £{t) denote the 
number of postings in term t's postings list. In disjunctive 
queries, disregarding various pruning and early termination 
schemes, retrieval algorithms must scan all lists to fully eval- 
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Figure 3: Bits per posting as function of the number of nodes for different docid assignment schemes and 
PForDelta encoding applied to .gov2 corpus, (a) without and (b) with overhead. 



uate the query. Hence, a surrogate measure for the running 
time of a disjunctive query on a particular index partition 
would be X^te ^(^)- ^^^ ^ locally-partitioned index among 
m nodes, query evaluation (again, disregarding timeout or 
pruning policies) must wait for the slowest partition to finish 
evaluating the query. Hence, we approximate the running 
time of g on m nodes in disjunctive semantics, Td{q), by 



Td{q) = max 

j = l,...,77l 



E^^w 



teq 

where £j{t) denotes the length of f's postings list on the 
j'ih partition. Moving to conjunctive models, queries are 
typically evaluated by join-flavored algorithms |10l 111) that 
rely on the ability to skip portions of postings lists where 
matches are known not to exist [28p. Therefore, our surro- 
gate for g's running time on a particular index partition is 
the length of the postings list of its rarest term, mintgq^(i). 
In a distributed setting, the slowest partition dictates that 

Tc{q) ~ max min t^j{t) . 

j — l ,m t£q 

We stress that we do not claim that these measures equal 
query running times - only that for most retrieval algorithms 
on RAM-resident indexes, they represent reasonable surro- 
gates that are correlated with running times. 

6.2 Experimental Evaluation 

We again use the TREC .gov2 corpus (see Section (31), 
and distribute its documents to servers using random dis- 
tribution (RND), and two flavors of URL-based distribu- 
tion. First, vanilla URL distribution (URL), where all docu- 
ments are ordered lexicographically according to their URL 
and then evenly sliced and routed to servers; second, IH- 
URL distribution - where hosts are randomly ordered and 
same host documents are lexicographically sorted according 
by URL before being evenly sliced and routed to servers. 



We ignore the small overhead that such skipping mecha- 
nisms add to the lengths of the postings lists. 



We use the 150 queries of TREC topics 701-850, whose 
average length is 3.1 terms, and report the average Tc = 
T^ X] =7oi'T':{q) S'lid the similarly defined average Td re- 
sulting from the three document distribution schemes over 
all queries, to qualitative compare their average query pro- 
cessing time. 

Figure|4]plots the Tc and Td curves for the two query types 
as functions of the number of servers, for the three docu- 
ment distribution schemes RND, URL, and IH-URL. The 
figure reveals the significant benefit of RND over the URL- 
based assignment schemes in terms of query processing time, 
and furthermore that the difference between the curves in- 
duced by RND and the URL-based schemes increases with 
the number of servers. In particular, RND induces T and 
Td curves that are an order of a magnitude lower (i.e. faster) 
than those induced by the URL-based schemes at m = 1000 
servers. A closer inspection of the RND curves reveal that 
their slope is approximately —1 in a log ~ log scale. Hence, 
RND scheme induced T and Td are proportionally inverse to 
the number of servers: Tc,Td ex: — . Finally, note the similar 
performance demonstrated by the two URL-based schemes, 
which is explained by the weak inter-host document similar- 
ity already observed in Section O 

The degradation in query processing time obtained by the 
URL-based distribution schemes can be intuitively explained 
by the fact that same host documents are similar (which 
is good for reducing the index size) and share many terms. 
Hence, placing them on the same partition yields unbalanced 
posting lists which increases query processing time due to 
the maximum operation included in the calculation of both 
Tc and Td- 

6.3 Analytical Evaluation 

This subsection analytically explains why the slopes of 
RND's Tc and Td curves are inversely proportional to the 
number of servers m. For simplicity, we assume the docu- 
ment generation model of [T^], in which terms are picked to 
document independently. Hence, for a disjunctive query q, 
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Figure 4: Average query processing time surrogates 
vs. number of servers for different document routing 
schemes. 



we can equate Tdiq) to the most occupied among m urns 
(servers) when bq = X^tgo df{t) balls (postings) are ran- 
domly tossed to the urns {df{t) denotes the document fre- 
quency of term t in the entire corpus) |16) . 

Proposition 1 For any < e < 1 and 5^ < S < 2e ~ 1, 



Prob Td{q) G 



'^>(l-f<5)' 
m m 



> 1 



where S^ = i I - — log 



Proof. Let bq balls be tossed randomly into m urns, and 
let Xj be the number of balls in the jth urn. In addition, 
denote the expected number of balls in an urn by /x = — , 



and let x„ 



'^j^l...,'m 



Xi be the maximal number of 



balls falling into some urn. Setting a = ^{1 + S) for some 
5 > we can write 



Pr {xmax > q) = Pr M) Tj > a 



< m Pr (xi > a) <m e * , 

where the first inequality is due to the union bound, and the 
second inequality is achieved by applying ChernofF's bound 
and holds for 5 < 2e — 1. Forcing the last term of the 
previous expression to be smaller than < e < 1, we have 
that 5 must also satisfy 



S> 



Am , m 
-— log — 



The proof is completed by recalling that x,nax > M- D 

The average number of postings bq for the TV = 150 TREC 
topics 701-850 is about 2.6 x 10* , whereas the number of 



servers in this experiment does not exceed 10 . Therefore, 
we can apply Prop. [T] and write 

q q \ q teg 

Hence, 7d is inversely proportional to the number of servers 
m, as observed in Fig. |4] 

The expression 71 corresponding to conjunctive queries 
involves a max-min operation, which complicates the exact 
analysis. Therefore, we analyze an upper bound which is 
obtained by only considering the rarest term of each query. 
In this case, 7^(q) equals the maximum urn occupancy of a 
simple urn model where bq = miutgq d,f{t) balls are randomly 
tossed into m urns. Since the average of bq for the A'' = 150 
queries of TREC topics 701-850 is about lO'^ - still at least 
two orders of magnitude over the number of servers m, we 
can apply Prop. [T]and write 

Tc^^yTc{q)^-^ybq^-(^ymmdf{t)] . 
N ^-^ ^ ' m N ^-^ m \ N ^-^ teg -^ ' 

q q \ g / 

Hence, as in the disjunctive case, Tc is inversely proportional 
to the number of servers m, as observed in Fig. ID It is noted 
that this approximated upper bound is tight since it has —1 
slope in log — log scale, and it includes the same constant as 
the experimented curve for m = 1. 

7. CONCLUSIONS 

We studied the impact of random index-partitioning on 
the performance of various docid assignment techniques, 
and demonstrated the deleterious effect of random index- 
partitioning in terms of the aggregated size of the parti- 
tioned index. We conjecture that our findings, based on the 
TREC .gov2 corpus and backed by some analysis, also hold 
at web scale - that randomized index-partitioning generates 
local collections that state-of-the-art ordering schemes can 
compress with relatively minor improvement over random 
ordering. The main reason for that, is that random index- 
partitioning causes pages of the same web host to be scat- 
tered over many nodes, resulting in local collections that are 
"sparse" in terms of URL continuity and that include few 
documents having high similarity with each other. There- 
fore, it follows that from a pure index size perspective, global 
index-partitioning where terms (instead of documents) are 
partitioned between nodes will compress better than the in- 
dustry standard of randomized local index-partitioning. We 
also show via experimental evaluation that most of the ef- 
fectiveness of URL sorting is achieved by merely clustering 
same host documents together. Moreover, we demonstrate 
that while URL sorting the documents within the hosts does 
yield additional improvement, keeping the lexical URL or- 
dering of the hosts brings only negligible benefit. Lastly, 
we demonstrate the benefits of the industry standard ran- 
dom partitioning of documents to servers in terms of query 
processing time, over URL-based partitioning schemes. 
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